1/31/2024
Chapter 2 - Modeling Process
- Data splitting (typical splits are 60% training / 40% testing, 70%/30%, or 80%/20%)
- Sampling methods
- Simple random sample (you can use base R, the caret package, the rsample package, or the h2o package)
- Stratified sampling - (again, a variety of methods can be used)
- Formula interfaces - (R formula is common, but there are other possibilities, just make sure you understand what you are doing before running the R code)
- Different functions use different engines - e.g., lm(), glm(), and train() all use slightly different specifications
- Resampling methods
- k-fold cross validation
- Bootstrapping
- Bias variance trade-off
- Hyperparameter tuning - more on this later when we get to the detailed methods
- Model evaluation
- MSE - mean squared error
- RMSE - root mean squared error
- Deviance - mean residual deviance
- MAE - mean absolute error
- RMSLE - root mean squared logarithmic error
- R^2 - coefficient of determination
- Classification models
- Misclassification - overall error
- Mean per class error
- MSE - mean squared error
- Cross entropy - log loss or deviance
- Gini index - mainly used with tree-based methods
- Accuracy
- Precision
- Sensitivity - recall
- Specificity
- AUC - area under the curve
Reading assignment - Please read Chapter 3 - Feature & Target Engineering.